Homework 3¶

The goal of Homework 3 is to study the LIME explanation method. This method explains the model's predictions locally, i.e. for a single instance from the dataset. First, it perturbes the observation of interest using random noise (later adapted for each feature independantly) and obtains model's predictions for these modified versions of the original instance, thus extracting the model's behaviour in a local neighborhood of this observation. Then, it attemps to approximate the relationship between model's output and the generated variations with a linear model. The weights of this trained linear model provide a measure of importance of each feature which is treated as the explanation for the model's prediction.

To properly study the LIME explanation method, I have reused the churn dataset from Homework 1 together with two models trained on a single train/test split: multilayer perceptron and random forest (the best performing models from the analysis shown in Homework 1). The churn dataset proposes a simple task of predicting whether a customer churned or not based on 8 numerical variables indicating the number of minutes spent on phone conversations and how much the client was charged for them (divided into 4 categories: day, evening, night and international). Because these features are easily interpretable, we can intuitively determine whether the model of choice behaves properly and according to our intuition.

Below, a visualization of LIME explanations for 5 observations (taken from the test split) and two models is presented. I will treat the random_forest model as the baseline and the multilayer perceptron as an additional model for comparison.

  1. How stable are these explanations?

Considering the random_forest model, these explanations seem to be stable across the 5 analysed instances. The most important feature for each observation is total_day_charge, which seems reasonable - whether a customer churned or not should be highly correlated with the amount of money spent on calls throughout the day, which is the most busy part of the whole 24-hour period. Note that, as expected due to class imbalance, the model predicts that the client did not churn in each case, although with varying certainty, and the total_day_charge influences this decision in both directions (promoting that the client churned for some instances, and promoting that it did not for others). The second most important feature is total_day_minutes (it is the second most important feature in 4 out of 5 observations). Once again, the importance of this feature seems reasonable as the clients' use of phone throughout the most important part of the day should be important in this decision. Other features seem to not be that important in a global sense (across all observations) in comparison to the consistency of the appearance of the first ones.

  1. Compare LIME between at least two different models. Are there any systematic differences across many observations?

Interestingly, the LIME explanation method indicates that the multilayer perceptron model bases its predictions on a different set of features in comparison to random_forest. Before analysing that in detail, it is important to highlight the fact that both of these models achieved test ROC AUC at around 0.7 and differ slightly in terms of PR AUC: 0.41 for random_forest vs. 0.43 for mlp. Thus, if the decision of choosing a better model was based on the metrics alone, we would probably pick the mlp model. However, a deeper analysis of the LIME explanations might show that we do not trust this model's predictions as much as those of random_forest. Indeed, the mlp model treats the total_intl_charge feature as the most important one in 2 out of 5 observations and total_day_charge as the most important for the other 3. This indicates an interesting pattern learned by mlp not observable that clearly in random_forest - that the amount of money spent by the client on international calls has a big influence on whether he churned or not. In general, features connected with international calls are ranked higher in terms of importance for the mlp model. The fact that mlp uses these features differently might justify a slightly higher performance in terms PR AUC. However, intuitively, the fact this feature is considered important might be connected with the imbalance - probably, the clients that did churn were mostly those who focused on international calls. Both of these groups are usually small in comparison to the whole population. It is, therefore, clear that these models exhibit a different internal decision-making process, and whether to choose the one or the other should include the conclusions drawn from an analysis of explanations such as those from the LIME method.

Thanks to its simplicity, the LIME explanation method provides easy to understand explanations that can be trusted, since approximating the model locally with a simpler linear should be very close to reality. Drawing conclusions from a simple explanatory analysis of two models, it was shown that incorporating the explanations into the process of choosing 'a better model' might be very valuable.

In [ ]:
make_explainers_plots()
Observation id: 0
Model: random_forest
Model: mlp
Observation id: 1
Model: random_forest
Model: mlp
Observation id: 2
Model: random_forest
Model: mlp
Observation id: 3
Model: random_forest
Model: mlp
Observation id: 4
Model: random_forest
Model: mlp
In [ ]:
for model_name, metrics_values in test_metrics.items():
    print(f'Model: {model_name}')
    for metric_name, metric_val in test_metrics[model_name].items():
        print(f'{metric_name}: {metric_val}')
Model: random_forest
roc_auc: 0.7081380988703643
pr_auc: 0.40525210981430104
Model: mlp
roc_auc: 0.7090189241218988
pr_auc: 0.434033512877087

Appendix¶

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, average_precision_score
In [ ]:
PATH_DATASET = 'churn.csv'
DATASET = pd.read_csv(PATH_DATASET, index_col = 0)
In [ ]:
SEED = 0
In [ ]:
METRICS = {
    'roc_auc': roc_auc_score,
    'pr_auc': average_precision_score}
In [ ]:
MODELS = {
    'random_forest': RandomForestClassifier(random_state = SEED),
    'mlp': Pipeline([
        ('standard_scaler', StandardScaler()),
        ('mlp', MLPClassifier((32, 32), 'relu', random_state = SEED))])}
In [ ]:
def get_train_test_split():
    dataset_npy = DATASET.values
    x, y = dataset_npy[:, :-1], dataset_npy[:, -1]
    return train_test_split(x, y, test_size = 0.2, random_state = SEED)

def train_models():
    x_train, x_test, y_train, y_test = get_train_test_split()
    results = {model_name: {} for model_name in MODELS.keys()}
    trained_models = {}
    for model_name, model in MODELS.items():
        model.fit(x_train, y_train)
        trained_models[model_name] = model
        y_pred = model.predict_proba(x_test)[:, -1]
        for metric_name, metric in METRICS.items():
            results[model_name][metric_name] = metric(y_test, y_pred)
    print('Finished')
    return trained_models, results
In [ ]:
trained_models, test_metrics = train_models()
x_train, x_test, y_train, y_test = get_train_test_split()
Finished
/Users/bartlomiejsobieski/miniforge3/envs/mimuw_xai/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
In [ ]:
import numpy as np
import lime
import lime.lime_tabular
np.random.seed(SEED)
In [ ]:
explainer = lime.lime_tabular.LimeTabularExplainer(
    x_train, 
    feature_names = DATASET.columns[:-1], 
    class_names = ['didnt churn', 'did churn'],
    discretize_continuous=True)
In [ ]:
idxs = [0, 1, 2, 3, 4]
explainers = {i: {} for i in idxs}
for idx in idxs:
    for model_name, model in trained_models.items():
        explainers[idx][model_name] = explainer.explain_instance(x_test[idx], model.predict_proba)
In [ ]:
def make_explainers_plots():
    for obs_id in explainers.keys():
        print(f'Observation id: {obs_id}')
        for model_name, explainer in explainers[obs_id].items():
            print(f'Model: {model_name}')
            explainer.show_in_notebook(show_table = True, show_all = False)